In [8]:
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy import stats

1. Map Visulization

1. Let's first have a general sense of the distribution of amenity in City Vancouver

In [11]:
van_osm_data = pd.read_json('../DataProcessing/processed/van_osm_data.json')
In [12]:
fig_neigh = px.scatter_mapbox(van_osm_data, lat="lat", lon="lon", color="neighborhood", 
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10)
#fig.update_layout(mapbox_style="open-street-map")
fig_neigh.update_layout(mapbox_style="open-street-map",
                 margin = {'l':0, 'r':0, 'b':0, 't':0})
fig_neigh.show()

We can see neighborhoods near Central Vancouver has a heavy density of amenities

2. we further refined our osm data, so it only include neighborhoods that have more than 50 airbnbs, let's have a look at it

In [13]:
refined_van_osm_data = pd.read_json('../DataProcessing/processed/van_osm_data_refined.json')
In [14]:
fig_neigh = px.scatter_mapbox(refined_van_osm_data, lat="lat", lon="lon", color="neighborhood", 
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10)
#fig.update_layout(mapbox_style="open-street-map")
fig_neigh.update_layout(mapbox_style="open-street-map",
                 margin = {'l':0, 'r':0, 'b':0, 't':0})
fig_neigh.show()

3. total number of amenities in neighborhoods

In [15]:
neighbor_data = pd.read_csv('../DataProcessing/processed/neighborhood_info.csv')
In [16]:
plt.figure(figsize=(20,15))
plt.xticks(rotation=60,ha='right')
sns.barplot(x="neighborhood", y="total_amenity", data=neighbor_data)
plt.title('Number of amenities in each neighborhood', fontsize=20)
Out[16]:
Text(0.5, 1.0, 'Number of amenities in each neighborhood')

Also, we can know how each category of amenities distributed in Vancouver

In [17]:
fig_neigh = px.scatter_mapbox(refined_van_osm_data, lat="lat", lon="lon", color="category", 
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10)
#fig.update_layout(mapbox_style="open-street-map")
fig_neigh.update_layout(mapbox_style="open-street-map",
                 margin = {'l':0, 'r':0, 'b':0, 't':0})
fig_neigh.show()

we can see sustenance (food) is more scatterly distributed in the city compared to other three categories. Transportation is heavily distributed in areas near Central Vancouver. Other two categories, arts abd leisure, are only in limited amount.

4. Let's now have a look of distribution of airbnb in Vancouver city

In [18]:
airbnb = pd.read_json('../DataProcessing/processed/airbnb_info.json')
In [19]:
fig_neigh = px.scatter_mapbox(airbnb, lat="lat", lon="lon", color="neighborhood", 
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10)
#fig.update_layout(mapbox_style="open-street-map")
fig_neigh.update_layout(mapbox_style="open-street-map",
                 margin = {'l':0, 'r':0, 'b':0, 't':0})
fig_neigh.show()

First, we can see the categorizing model of neighborhood works pretty good for airbnb dataset. Airbnbs are seperated into 18 clusters that are closely located.

In [20]:
plt.figure(figsize=(20,15))
plt.xticks(rotation=60, ha='right')
sns.countplot(x = 'neighborhood', data = airbnb, palette = 'magma')
plt.title('Number of Airbnbs in each neighborhood', fontsize=20)
Out[20]:
Text(0.5, 1.0, 'Number of Airbnbs in each neighborhood')

2. Statistic Test

1. relationship between number of amenity and number of airbnb in neighborhood

Interestingly, it seems neighborhoods with more amenities tend to have more airbnbs

we can test our hypothesis by a linear regression line test

Central Vancouver can be a outlier because it has a lot more amenities than other neighborhood. Therefore we exclude Central Vancouver from our test dataset

In [21]:
airbnb_count = airbnb.groupby('neighborhood').size().reset_index()
In [22]:
airbnb_count.rename(columns = {0:'total_airbnb'}, inplace=True)
In [23]:
neighbor_data = neighbor_data.join(airbnb_count.set_index('neighborhood'), on='neighborhood')
In [24]:
test_data = neighbor_data[neighbor_data['neighborhood']!='Central Vancouver']
In [25]:
plt.figure(figsize=(5,4))
sns.scatterplot(x="total_amenity", y="total_airbnb", data=test_data)
plt.title('Number of Amenity vs Number of Airbnb', fontsize=12)
plt.show()
In [26]:
reg = stats.linregress(test_data['total_amenity'], test_data['total_airbnb'])
print("p-value of linear regression slope test = ",reg.pvalue)
p-value of linear regression slope test =  2.8487185670130265e-05
conclusion: the p-value of regression slope test is much smaller than usual α values. Therefore, we can reject the null hypothesis and conclude that there is a positive linear relationship between number of amenities and number of airbnb

2. relationship between airbnb score and other factors

With all the data we have obtained, we can now explore the relationship between Airbnb score and other factors. Most of our data are not perfectly normally distributed, thus our analysis would be based on central limit theorem such that we launch our analysis with sample means of each neighborhood.

Lets's first have a look of average airbnb score in eahc neighborhood

In [30]:
plt.figure(figsize=(10,8))
sns.barplot(x="neighborhood", y="score", data=neighbor_data)
plt.xticks(rotation=60, ha='right')
plt.title('Average score in neighborhood', fontsize=12)
plt.show()

2.1 relationship between airbnb score and price

In [18]:
plt.figure(figsize=(5,4))
sns.scatterplot(x="avg_price", y="score", data=test_data)
plt.title('Average Price vs Average Score', fontsize=12)
plt.show()
In [19]:
reg = stats.linregress(test_data['avg_price'], test_data['score'])
print("p-value of linear regression slope test = ",reg.pvalue)
p-value of linear regression slope test =  0.0033618882578678863
conclusion: the p-value of regression slope test is smaller than 0.05. Therefore, we can reject the null hypothesis and conclude that there is a positive linear relationship between price and score of airbnb. It seems higher price and lead to higher score

2.2 relationship between airbnb score and number of sustenance nearby

In [20]:
# without central limit theorem, test on each individual airbnb
# we can hardly tell anything from this
plt.figure(figsize=(5,4))
sns.scatterplot(x="sustenance", y="avg_score", data=airbnb)
plt.title('Number of sustenance vs Average Score', fontsize=12)
plt.show()
In [21]:
# with central limit theorem, test on neighborhood
plt.figure(figsize=(5,4))
sns.scatterplot(x="sustenance", y="score", data=test_data)
plt.title('Number of sustenance vs Average Score', fontsize=12)
plt.show()
In [22]:
reg = stats.linregress(test_data['sustenance'], test_data['score'])
print("p-value of linear regression slope test = ",reg.pvalue)
p-value of linear regression slope test =  0.012764052888733137
conclusion: score of airbnb tend to be higher if there are more sustenance nearby

2.3 relationship between airbnb score and number of transportation nearby

In [23]:
plt.figure(figsize=(5,4))
sns.scatterplot(x="transportation", y="score", data=test_data)
plt.title('Number of sustenance vs Average Score', fontsize=12)
plt.show()
In [24]:
reg = stats.linregress(test_data['sustenance'], test_data['score'])
print("p-value of linear regression slope test = ",reg.pvalue)
p-value of linear regression slope test =  0.012764052888733137
conclusion: Though the p-value of regression slope test is less than 0.05, we can hardly tell a linear regression between these two variables from the scatter plot. Therefore, we may not conclude there is a linear relationship between score and number of transportation nearby